When you complete this session, you will be able:
to understand about variable selection in multiple linear regression (MLR);
to understand how to be dividing a dataset into a training and test set so that we can construct models using the former, and then test their predictive ability using the latter.
to understand how to calculating internal measures of predictive ability such as PRESS (predicted residual sum of squares);
to understand about multicollinearity.
The Boston data frame has 506 rows and 14 columns, about housing values in suburbs of Boston. This dataset is available within MASS R package.
The aim is to predict housing values in suburbs of Boston based on 13 explanatory variables in the dataset.
This data frame contains the following columns:
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
- ptratio: pupil-teacher ratio by town.
- black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
- lstat: lower status of the population (percent).
- medv: median value of owner-occupied homes in $1000s.
Answer the following 8 tasks.
Hint. Use R to retrieve the data as follows. This dataset is available within MASS R package, you must install this package and then use the following.
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
The response variable is medv
We know that chas and rad are categorical.
[1] 506 12
The variables rm (+0.70) and lstat (-0.74) are with the highest correlations.
Boston1 into a training set and a test set so that the
training set contains 80% of the data in Boston1 and the
test set contains 20%. Show how you would do that. Call your training
and test set Train.Boston1 and Test.Boston1,
respectively.set.seed(2401)
TestIndex <- sample(nrow(Boston1), floor(0.2*nrow(Boston1)))
Test.Boston1 <- Boston1[TestIndex, ]
Train.Boston1 <- Boston1[-TestIndex, ]Based on your (partial) exploratory analysis above in Task 2 (and
perhaps additional analysis), you choose only one explanatory variable
with which to build a simple linear regression using
Train.Boston1.
The variable rm. It has a positive +0.70 correlation with medv
ii. Fit a simple linear regression using this variable using the `training set`. Output a summary of the linear model. What is the value of the estimated slope?
Call:
lm(formula = medv ~ rm, data = Train.Boston1)
Residuals:
Min 1Q Median 3Q Max
-21.691 -2.664 0.014 3.084 38.740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.9149 3.0108 -10.27 <2e-16 ***
rm 8.4858 0.4771 17.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.916 on 403 degrees of freedom
Multiple R-squared: 0.4398, Adjusted R-squared: 0.4384
F-statistic: 316.4 on 1 and 403 DF, p-value: < 2.2e-16
The value of the estimated slope is 8.4858
iii. Output diagnostic plots and comment on whether you think they indicate
any inadequacies in your linear model. You only need to comment on the first
three of these 'standard' diagnostic plots.
This is an example of interpretation, depends on the training set.
Linearity: This is not satisfied as the smoothing curve looks nonlinear
Normality: Many upper and lower points off the normal reference line, normality is not satisfied
Constant variance: There is a non-constant variance from the 3rd plot, as the variance decreases for fitted values less than 20, but then increasing for greater than 20.
Some possible outliers outside (-4,4) and many leverage points (good or bad).
Train.Boston2). Comment about the fitted model,
specifically about which predictors are significant at a 5% significance
level.
Call:
lm(formula = medv ~ ., data = Train.Boston1)
Residuals:
Min 1Q Median 3Q Max
-12.317 -3.028 -0.592 1.969 26.668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.125301 5.835100 6.191 1.51e-09 ***
crim -0.071607 0.045742 -1.565 0.11828
zn 0.019456 0.017801 1.093 0.27507
indus -0.085737 0.073655 -1.164 0.24511
nox -17.974482 4.565558 -3.937 9.76e-05 ***
rm 3.663869 0.478957 7.650 1.57e-13 ***
age 0.007483 0.016062 0.466 0.64157
dis -1.570527 0.241110 -6.514 2.25e-10 ***
tax 0.002344 0.002845 0.824 0.41045
ptratio -0.925944 0.155434 -5.957 5.70e-09 ***
black 0.008738 0.003218 2.715 0.00691 **
lstat -0.558664 0.060083 -9.298 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.131 on 393 degrees of freedom
Multiple R-squared: 0.6993, Adjusted R-squared: 0.6909
F-statistic: 83.08 on 11 and 393 DF, p-value: < 2.2e-16
Call:
lm(formula = medv ~ 1, data = Train.Boston1)
Residuals:
Min 1Q Median 3Q Max
-17.287 -5.687 -1.287 2.713 27.713
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.2874 0.4586 48.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.229 on 404 degrees of freedom
# Forward selection with intermediate output
model.f<-step(model.1, scope = formula(model.all), direction = "forward", trace = 0)The final model using Forward is
Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + black,
data = Train.Boston1)
Residuals:
Min 1Q Median 3Q Max
-12.2605 -3.1328 -0.7291 1.8675 26.9514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.848832 5.646546 6.526 2.06e-10 ***
lstat -0.570857 0.053838 -10.603 < 2e-16 ***
rm 3.792846 0.459828 8.248 2.37e-15 ***
ptratio -0.996059 0.129305 -7.703 1.07e-13 ***
dis -1.394089 0.197644 -7.054 7.78e-12 ***
nox -18.726505 3.857569 -4.854 1.74e-06 ***
black 0.009171 0.003111 2.948 0.00339 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.13 on 398 degrees of freedom
Multiple R-squared: 0.6956, Adjusted R-squared: 0.691
F-statistic: 151.6 on 6 and 398 DF, p-value: < 2.2e-16
The predictors which are significant are: lstat; rm; ptratio; dis; nox; black
This is an example of interpretation, depends on the training set.
Linearity: This is not satisfied as the smoothing curve looks like a curve
Normality: Some upper points off the normal reference line, normality may or may not be satisfied
Constant variance: There is a non-constant variance from the 3rd plot, as the variance decreases for fitted values less than 20, but then increasing for greater than 20.
Some possible outliers outside (-4,4) and many leverage points (good or bad).
[1] 4.210982
[1] 11064.88
BodyFat.RData[1] "TrainBodyFat1" "TestBodyFat1"
[1] 221 14
[1] 24 14
A training set and a test set!
Note that because there are 14 variables in the dataset, we need to
make the scatterplot matrix a bit bigger than the default size, and we
can do so by using the chunk options fig.width and
fig.height. Note these are knitr/Rmarkdown
options, not R code.
It’s a good idea to plot the test set as well, because we want to make sure that there aren’t any outliers that might inflate prediction error.
In the training set, it’s clear that all of the variables are highly correlated - not surprising considering that body measurements tend to go up and down together.
We can see that in the training set there are a couple of individuals who have very large ankle measurements relative to other body measurements. No such outliers appear to exist in the test set.
There are lots of ways of removing the two individuals outlined
above, but a convenient way is to use the function subset.
See the help file for more details on how to use it. The two individuals
to remove have ankle measurements of greater than 30 cm, and we will use
this information to keep only those individuals whose ankle measurements
are less than 30 cm.
[1] 221 14
[1] 219 14
The reason we’re going to construct the simplest model possible, and the model with the most number of variables, is so that we can use them in the simplest of all sequential selection methods, forward stepwise regression. Have a look at the lecture notes for a fuller explanation of how it works.
Call:
lm(formula = bodyfat ~ ., data = TrainBodyFat1)
Residuals:
Min 1Q Median 3Q Max
-11.1718 -3.0671 0.0207 3.0749 9.5016
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.99231 26.11482 0.153 0.87865
Age 0.06523 0.03595 1.815 0.07104 .
Weight -0.02598 0.07343 -0.354 0.72385
Height -0.28600 0.20980 -1.363 0.17432
Neck -0.39758 0.25827 -1.539 0.12525
Chest -0.09845 0.11844 -0.831 0.40679
Abdomen 0.90690 0.10092 8.986 < 2e-16 ***
Hip -0.16953 0.16045 -1.057 0.29195
Thigh 0.20533 0.15989 1.284 0.20053
Knee -0.09147 0.27593 -0.331 0.74061
Ankle 0.13236 0.40734 0.325 0.74556
Biceps 0.18373 0.18570 0.989 0.32362
Forearm 0.28940 0.21856 1.324 0.18694
Wrist -1.69147 0.62595 -2.702 0.00746 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.353 on 205 degrees of freedom
Multiple R-squared: 0.7339, Adjusted R-squared: 0.717
F-statistic: 43.49 on 13 and 205 DF, p-value: < 2.2e-16
Once we have constructed the models above, we can use them in the
function for running forward stepwise selection. The argument
trace = 0 suppresses intermediate output.
Start: AIC=921.71
bodyfat ~ 1
Df Sum of Sq RSS AIC
+ Abdomen 1 9590.3 5008.0 689.41
+ Chest 1 6838.2 7760.1 785.32
+ Hip 1 5397.3 9201.1 822.62
+ Weight 1 5044.8 9553.5 830.85
+ Thigh 1 3850.0 10748.3 856.66
+ Neck 1 3359.1 11239.3 866.44
+ Knee 1 3191.5 11406.9 869.68
+ Biceps 1 2932.0 11666.3 874.61
+ Wrist 1 1669.5 12928.9 897.11
+ Forearm 1 1590.4 13007.9 898.45
+ Age 1 1561.6 13036.7 898.93
+ Ankle 1 1047.4 13550.9 907.41
<none> 14598.3 921.71
+ Height 1 73.6 14524.8 922.60
Step: AIC=689.41
bodyfat ~ Abdomen
Df Sum of Sq RSS AIC
+ Weight 1 714.23 4293.8 657.71
+ Height 1 581.33 4426.7 664.39
+ Wrist 1 573.39 4434.6 664.78
+ Neck 1 399.27 4608.7 673.21
+ Ankle 1 379.09 4628.9 674.17
+ Hip 1 356.78 4651.2 675.22
+ Knee 1 294.89 4713.1 678.12
+ Chest 1 198.06 4810.0 682.57
+ Age 1 154.17 4853.8 684.56
+ Thigh 1 123.85 4884.2 685.93
+ Forearm 1 101.28 4906.7 686.93
+ Biceps 1 100.30 4907.7 686.98
<none> 5008.0 689.41
Step: AIC=657.71
bodyfat ~ Abdomen + Weight
Df Sum of Sq RSS AIC
+ Wrist 1 146.804 4147.0 652.09
+ Thigh 1 70.001 4223.8 656.11
+ Neck 1 51.396 4242.4 657.07
+ Biceps 1 50.410 4243.4 657.13
+ Height 1 47.118 4246.7 657.30
<none> 4293.8 657.71
+ Forearm 1 22.411 4271.4 658.57
+ Ankle 1 4.656 4289.1 659.47
+ Chest 1 1.727 4292.1 659.62
+ Age 1 1.214 4292.6 659.65
+ Hip 1 0.715 4293.1 659.68
+ Knee 1 0.000 4293.8 659.71
Step: AIC=652.09
bodyfat ~ Abdomen + Weight + Wrist
Df Sum of Sq RSS AIC
+ Biceps 1 70.917 4076.1 650.32
+ Height 1 52.325 4094.7 651.31
+ Forearm 1 51.997 4095.0 651.33
<none> 4147.0 652.09
+ Thigh 1 35.113 4111.9 652.23
+ Age 1 22.963 4124.0 652.88
+ Neck 1 8.914 4138.1 653.62
+ Hip 1 3.772 4143.2 653.89
+ Ankle 1 2.619 4144.4 653.95
+ Knee 1 1.872 4145.1 653.99
+ Chest 1 0.194 4146.8 654.08
Step: AIC=650.32
bodyfat ~ Abdomen + Weight + Wrist + Biceps
Df Sum of Sq RSS AIC
<none> 4076.1 650.32
+ Age 1 25.9071 4050.2 650.92
+ Height 1 25.7919 4050.3 650.93
+ Neck 1 24.2258 4051.8 651.01
+ Forearm 1 22.8574 4053.2 651.08
+ Thigh 1 15.1465 4060.9 651.50
+ Ankle 1 7.8960 4068.2 651.89
+ Hip 1 3.2417 4072.8 652.14
+ Knee 1 1.9689 4074.1 652.21
+ Chest 1 1.8034 4074.3 652.22
Call:
lm(formula = bodyfat ~ Abdomen + Weight + Wrist + Biceps, data = TrainBodyFat1)
Residuals:
Min 1Q Median 3Q Max
-10.0819 -3.1413 -0.1155 3.3936 9.1313
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.55819 7.62550 -4.270 2.94e-05 ***
Abdomen 0.97858 0.05909 16.560 < 2e-16 ***
Weight -0.12828 0.02963 -4.330 2.29e-05 ***
Wrist -1.42443 0.48061 -2.964 0.00338 **
Biceps 0.31347 0.16245 1.930 0.05498 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.364 on 214 degrees of freedom
Multiple R-squared: 0.7208, Adjusted R-squared: 0.7156
F-statistic: 138.1 on 4 and 214 DF, p-value: < 2.2e-16
Remarkably, only four measurements are selected by the procedure. But let’s see how well it predicts body fat.
# Predictions from the model with all varaibles
allpred <- predict(lmall, newdata = TestBodyFat1)
# Predictions from the model selected by forward selection
fwdpred <- predict(lmfwd, newdata = TestBodyFat1)Actual <- TestBodyFat1$bodyfat
RMSEP.all <- sqrt(sum((Actual - allpred)^2)/length(Actual)); RMSEP.all[1] 3.391678
[1] 3.190346
The key message here is that more variables in a model don’t lead to better predictions: a model with four variables yields a smaller RMSEP than a model with many more variables.
The data Prostate Cancer (Stamey et al. (1989)) examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy.The dataset is prostate.txt
The goal is to predict the log of PSA lpsa from a number of measurements including log cancer volume lcavol, log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capsular penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45.
Loading the dataset.
options(digits=3) ## output
prostate_data=read.table(file='prostate.txt',header=TRUE) # Read the data in. Men lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
1 1 -0.580 2.77 50 -1.39 0 -1.39 6 0 -0.431 TRUE
2 2 -0.994 3.32 58 -1.39 0 -1.39 6 0 -0.163 TRUE
3 3 -0.511 2.69 74 -1.39 0 -1.39 7 20 -0.163 TRUE
4 4 -1.204 3.28 58 -1.39 0 -1.39 6 0 -0.163 TRUE
5 5 0.751 3.43 62 -1.39 0 -1.39 6 0 0.372 TRUE
6 6 -1.050 3.23 50 -1.39 0 -1.39 6 0 0.765 TRUE
'data.frame': 97 obs. of 11 variables:
$ Men : int 1 2 3 4 5 6 7 8 9 10 ...
$ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
$ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
$ age : int 50 58 74 58 62 50 64 58 47 63 ...
$ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
$ svi : int 0 0 0 0 0 0 0 0 0 0 ...
$ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
$ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
$ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
$ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
$ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
prostate_data=prostate_data%>% # Remove ID variable
dplyr::select(-(Men))
prostate_data_train=prostate_data%>% # We only wish to keep the training data to model.
filter(train==TRUE)
prostate_data_test=prostate_data%>% # We only wish to keep the testing data to model.
filter(train==FALSE)[1] 67 10
[1] 30 10
lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
1 -0.580 2.77 50 -1.39 0 -1.39 6 0 -0.431 TRUE
2 -0.994 3.32 58 -1.39 0 -1.39 6 0 -0.163 TRUE
3 -0.511 2.69 74 -1.39 0 -1.39 7 20 -0.163 TRUE
4 -1.204 3.28 58 -1.39 0 -1.39 6 0 -0.163 TRUE
5 0.751 3.43 62 -1.39 0 -1.39 6 0 0.372 TRUE
6 -1.050 3.23 50 -1.39 0 -1.39 6 0 0.765 TRUE
# a more informatic graph
library("PerformanceAnalytics")
chart.Correlation(prostate_data[,1:9],histogram=TRUE,pch=19)We can also create the scatterplot matrix using ggplot2 and GGally R packages.
We can see that lcavol, lweight and age appear to be positively correlated with lpsa.
lcavol lweight age lbph svi lcp gleason pgg45 lpsa
lcavol 1.0000 0.2805 0.225 0.0273 0.5388 0.675 0.4324 0.4337 0.734
lweight 0.2805 1.0000 0.348 0.4423 0.1554 0.165 0.0569 0.1074 0.433
age 0.2250 0.3480 1.000 0.3502 0.1177 0.128 0.2689 0.2761 0.170
lbph 0.0273 0.4423 0.350 1.0000 -0.0858 -0.007 0.0778 0.0785 0.180
svi 0.5388 0.1554 0.118 -0.0858 1.0000 0.673 0.3204 0.4576 0.566
lcp 0.6753 0.1645 0.128 -0.0070 0.6731 1.000 0.5148 0.6315 0.549
gleason 0.4324 0.0569 0.269 0.0778 0.3204 0.515 1.0000 0.7519 0.369
pgg45 0.4337 0.1074 0.276 0.0785 0.4576 0.632 0.7519 1.0000 0.422
lpsa 0.7345 0.4333 0.170 0.1798 0.5662 0.549 0.3690 0.4223 1.000
This is important when predictors are on differing scales. We standardize by mean and standard deviation. We do NOT scale our response variable. In larger datasets, standardization also speeds up learning algorithms.
First we exclude lpsa and reorder the variables to be the same order as the reference.
predictors_train=prostate_data_train[ ,c('lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45')]
predictors_test=prostate_data_test[ ,c('lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45')]#Scale function to standardize mean and sd
predictors_scaled=as.data.frame(scale(predictors_train))
prostate_data_train=data.frame(prostate_data_train$lpsa,predictors_scaled)
names(prostate_data_train)= c('lpsa', 'lcavol', 'lweight', 'age', 'lbph' , 'svi' , 'lcp', 'gleason', 'pgg45')predictors_scaled_test=as.data.frame(scale(predictors_test))
prostate_data_test=data.frame(prostate_data_test$lpsa,predictors_scaled_test)
names(prostate_data_test)= c('lpsa', 'lcavol', 'lweight', 'age', 'lbph' , 'svi' , 'lcp', 'gleason', 'pgg45')best_subsets <- regsubsets(lpsa~., data = prostate_data_train, nvmax = 8,nbest=1)
summary(best_subsets)Subset selection object
Call: regsubsets.formula(lpsa ~ ., data = prostate_data_train, nvmax = 8,
nbest = 1)
8 Variables (and intercept)
Forced in Forced out
lcavol FALSE FALSE
lweight FALSE FALSE
age FALSE FALSE
lbph FALSE FALSE
svi FALSE FALSE
lcp FALSE FALSE
gleason FALSE FALSE
pgg45 FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
lcavol lweight age lbph svi lcp gleason pgg45
1 ( 1 ) "*" " " " " " " " " " " " " " "
2 ( 1 ) "*" "*" " " " " " " " " " " " "
3 ( 1 ) "*" "*" " " " " "*" " " " " " "
4 ( 1 ) "*" "*" " " "*" "*" " " " " " "
5 ( 1 ) "*" "*" " " "*" "*" " " " " "*"
6 ( 1 ) "*" "*" " " "*" "*" "*" " " "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
The best possible subset containing two predictors contains lcavol and lweight. This is also known as the “one standard error” rule, whereby the model that is simplest but also within one standard deviation of the minimum sum of squares model.
Note the stars indicate which variable we would include in the best subet. The argument nbests returns the best subset for each size up to nvmax.
Call:
lm(formula = lpsa ~ lcavol + lweight, data = prostate_data_train)
Residuals:
Min 1Q Median 3Q Max
-1.589 -0.442 0.013 0.526 1.931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4523 0.0930 26.37 < 2e-16 ***
lcavol 0.7799 0.0982 7.94 4.1e-11 ***
lweight 0.3519 0.0982 3.58 0.00066 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.761 on 64 degrees of freedom
Multiple R-squared: 0.615, Adjusted R-squared: 0.603
F-statistic: 51.1 on 2 and 64 DF, p-value: 5.54e-14
#We need to calculate RSS for the intercept only model first. RSS_int
prostate_intercept=lm(lpsa~1,data=prostate_data_train)
intercept_pred=predict(prostate_intercept,new.data=prostate_data_train)
RSS_intercept=sum((intercept_pred-prostate_data_train$lpsa)^2)
RSS_intercept[1] 96.3
plot(x=0:8,c(RSS_intercept,summary(best_subsets)$rss),xlim=c(0,8),ylim=c(0,100),xlab="Subset Size k",ylab="Residual Sum-of-Squares",col='red')
lines(x=0:8,c(RSS_intercept,summary(best_subsets)$rss),col='red')
9. Re-run
regsubsets with nbest = 1, and then
construct plots of adjusted \(R^2\),
\(C_p\), and BIC against number of
variables. You will need to extract these criteria from the summary
object.
# Modify some graphical parameters
par(mfrow = c(1, 3))
par(cex.axis = 1.5)
par(cex.lab = 1.5)
AllSubsets <- regsubsets(lpsa~., data = prostate_data_train, nvmax = 8,nbest=1)
AllSubsets.summary <- summary(AllSubsets)
plot(1:8, AllSubsets.summary$adjr2, xlab = "subset size", ylab = "adjusted R-squared", type = "b")
plot(1:8, AllSubsets.summary$cp, xlab = "subset size", ylab = "Mallows' Cp", type = "b")
abline(0,1,col=2)
plot(1:8, AllSubsets.summary$bic, xlab = "subset size", ylab = "BIC", type = "b")This now gives the “best” choice of explanatory variables for models of size 1,2,…8. So out of all models with only a single covariate, we would choose (“lcavol”) as the best predictor.
\(R^2\) recommends p=7; BIC recommends 2, 3 or 4.
For the \(C_p\) Mallows, you can add a reference line to help you to locate the subset size that satisfies \(C_p \approx p.\) From the above plot, it is around 7.
As we saw in Lecture 10, one internal measure of predictive ability of a model is the PRESS statistic. The idea is to set aside a single observation \(y_i\) from the dataset, and then, using the model constructed from the remaining data, predict the value of the observation left out (\(\hat{y}_{-i}\)), and calculate the residual \(y_i - \hat{y}_{-i}\). Do this until all observations have been left out, and then calculate the sum of the squares of these residuals, i.e., \[ \mbox{PRESS} = \sum_{i=1}^n (y_i - \hat{y}_{-i})^2 \] As we saw in Lecture 10, for a linear regression, calculating \(\mbox{PRESS}\) doesn’t actually require this iterative procedure. Instead, \(\mbox{PRESS}\) can be calculated more easily using values computed from fitting the model to all the data, i.e., \[ \mbox{PRESS} = \sum_{i=1}^n \left(\frac{\hat{e}_i}{1 - h_{ii}}\right)^2 \] In the expression above, \(\hat{e}_i\) is the \(i\)th residual, and \(h_{ii}\) is the \(i\)th diagonal element of the ‘hat’ matrix \(H = X(X'X)^{-1}X'\).
The library DAAG (Maindonald, J.H. and Braun, W.J.
(2010) Data Analysis and Graphics Using R, 3rd ed. Cambridge:
Cambridge University Press) contains the function press
that you can use instead of coding the explicit expressions above.
If you do not have DAAG installed, you have to install
it first, and then load it into your R session using the
command library(DAAG) before you can invoke
press.
Compare the value of the \(\mbox{PRESS}\) statistic for the forward and all parameters models you constructed above in Question 2. The models are lmfwd and lmall
[1] 4269
[1] 4403
We revisit this dataset to include quadratic terms in the model.
The data frame Defects provides data on the average
number of defects per 1000 parts (Defective) produced in an
industrial process along with the values of other variables
(Temperature, Density, and Rate).
The production engineer wishes to construct a linear model relating
Defective to the potential predictors.
pairs function to produce a scatterplot matrix
of all the variables in Defects. Are the scatterplots of
Defective against the other variables linear?From the scatterplots, we can see that Defective has:
Clearly this is a strong positive relationship between the number of defects produced and each of the other variables, although there is a visible curvature to the relationships suggesting that these relationships are not linear (although the relationships between each of temperature, density and rate are so).
Explain the fitted model, by interpreting the R output. We have done this in Comp Lab Week 10.
Call:
lm(formula = Defective ~ Temperature + Density + Rate, data = defects)
Residuals:
Min 1Q Median 3Q Max
-12.737 -4.112 -0.575 2.762 16.328
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.324 65.926 0.16 0.877
Temperature 16.078 8.294 1.94 0.063 .
Density -1.827 1.497 -1.22 0.233
Rate 0.117 0.131 0.89 0.380
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.11 on 26 degrees of freedom
Multiple R-squared: 0.88, Adjusted R-squared: 0.866
F-statistic: 63.4 on 3 and 26 DF, p-value: 4.37e-12
\[var(\hat{\beta_j})=\frac{1}{1-r_{12}^2} \frac{\sigma^2}{S_{x_j}^2}\] where \(r_{12}\) denote the correlation between \(x_1\) and \(x_2\) and \(S_{x_j}\) denote the standard deviation \(x_j.\) The term \(\frac{1}{1-r_{12}^2}\) is called a variance inflation factor (VIF).
There are high correlations among explanatory variables as shown in (a), which may be related to multicollinearity.
The variance inflation factor VIF can be calculated using the function ‘vif’ in the ‘car’ R package.
Temperature Density Rate
13.43 14.51 6.64
A number of these variance inflation factors exceed 5, the cut-off often used, and so the associated regression coefficients are poorly estimated due to multicollinearity.